The time-series line in the heading above is derived from current incidence cases of COVID-19 in Australia.
This web site is a joint effort of researchers at the South Western Sydney Clinical School and the Centre for Big Data Research in Health at the UNSW Faculty of Medicine, the Econometrics and Business Statistics Research Group of Monash University, and at the Ingham Institute for Applied Medical Research in Liverpool, Sydney.
The intent is to offer a range of principled epidemiological and statistical analyses and visualisations of current COVID-19 data which go beyond the now ubiquitous world maps and cumulative incidence charts.
The broad themes for the analyses and visualisations currently available are listed in the menu at the top of this page – more will be added in due course. For each theme there is an introductory page explaining the motivating ideas and methodology employed for each of the visualisations or analyses for that theme, which are available in the subsequent frames (the series of rectangles at the top of each page). Additional notes or commentary appear on the right of some pages.
This site has been created by researchers at Australian universities, and hence the focus is on the situation in Australia, within the broader international context – we are, after all, all in this together. However, we hope that some of the analyses and visualisations on this site might be useful elsewhere, and to that end, all the R source code used to create this site if freely available – please see the Technical details tab above for details on software used, and where to find the source code.
The content on this site is licensed under a Creative Commons Attribution 4.0 International License. If you wish to re-use any of the content, please retain the attribution which appears on each chart or set of charts, or otherwise provide that attribution alongside the content that you re-use.
In turn, we acknowledge RECON for providing financial support, and European Centre for Disease Control who have provided up to date data on the COVID19 pandemic.
Project has been led by Timothy Churches, Nicholas Tierney, with thanks to Stuart Lee, Dianne Cook, Miles McBain, and Rob Hyndman
Visualisations and analysis made available in the covidrecon package
Source code for this visualisation are available at covid-flexdashboard
Data Analysis has been entirely created within the R programming language using Rstudio.
Packages used include:
Incidence is the epidemiological term for the number of cases of a disease meeting some case definition in a specified time period. Here we present the daily incidence – that is, the number of new cases each day – of COVID-19 for a range of countries, including Australia, of course. Each country may be using a slightly different case definition, although most of the countries presented here have been using case definitions aligned with those recommended by the WHO. Three types of case definition are used, simultaneously:
All of the data reported here are for confirmed cases, with a few exceptions (China included cases diagnosed via lung CT scan for a few days in February, 20020, but we have adjusted the data for those days as far as possible to remove that anomaly.
The data are drawn from the European CDC, which has collected data from various governing agencies around the world. There are some small discrepancies between these data and those given by the Australian government, and indeed other national governments, due to timing issues relating to when cases are tabulated each day, and so on. However, we believe the the European CDC to be the best source of automatically downloadable, machine-readable data there is right now.
Note, however, that because the confirmed case definition depends on laboratory confirmation, it is influenced by the number of lab tests (RT-PCR, reverse transcriptase polymerase chain reaction tests on nasopharyngeal swabs or sputum) done by each reporting country or jurisdiction. Obviously the number of reported cases is bounded by the number of such labs tests which are done, but the degree of under-ascertainment is also affected by the policies in place which determine which potential COVID-19 cases are tested. These policies are completely country-specific and have changed over time.
Another issue with these data is that it is vastly preferable to analyse incidence by (presumed or definite) date of onset of symptoms, rather than by date of notification or date of reporting. There may be variable delays in the processing of laboratory tests and the reporting of cases to central authorities. Tabulation of incidence counts by date of onset overcomes this problem. Almost all national and jurisdictional health authorities will be collecting data on presumed date of onset for each case (although, inexplicably, it is not one of the data items on the WHO-recommended data collection form). One reason for not using date of onset is that it may be incomplete, but this ignores the fact that statistical imputation can be used to validly fill in those missing dates of onset. There are also a body of methods, collectively known as nowcasting, that use multivariate time-series models to convert data tabulated by date of notification/reporting to (estimated) date of onset. If national or jurisdictional health authorities do not have the technical capacity to undertake such value-adding to their own data, they should make the required data available to trusted partners in the academic sector who can undertake such statistical manipulation for them (or help authorities implement such processing internally).
Cumulative incidence is, as the name suggests, just the cumulative sum of the daily incidence - that is, a running total of the number of cases. Reporting of cumulative COVID-19 seems to dominate the mainstream media, but it has many disadvantages. In particular, a cumulative sum of case counts is always monotonically increasing – it can only ever go up, or at best, remain flat if there are no new incident cases. This tends to obscure the rate of change in incidence over time – subtle, or even large changes in the slope of the cumulative incidence curve are difficult to see.
The first chart presented here is a semi-log cumulative incidence chart. This chart seems to have been popularised by John Burn-Murdoch in the Financial Times, but it appears it was first used by Matt Cowgill from Australia’s very own Grattan Institute. It has since been widely copied and reproduced. Please see a blog post by Prof Rob Hyndman for further discussion of this chart, and some alternative analyses.
There are several variations on the Grattan chart presented, please see the notes for each one.
The epicurve is perhaps the most-used chart in field epidemiology and outbreak control. It is simply a chart of daily (or weekly, for slower-moving diseases) incidence (new cases), traditionally shown as a bar chart. It gives an immediate sense of whether an outbreak or epidemic is in a growth phase, with increasing incident counts each successive day, or in a decay phase, with decreasing counts each successive day. Note that the cumulative count will still increase, day-on-day, even when an outbreak or epidemic is in a decay phase. Only when the epidemic has been completely extinguished will the cumulative incidence stop increasing. That’s one of the reasons why cumulative incidence charts are rarely used by epidemiologists.
Three variations of epicurves are provided here:
This chart shows the cumulative cases of COVID-19 for selected countries on a logarithmic y-axis scale, with the dates on the x-axis converted to the number of days since each country shown exceeded 100 cumulative cases, on a linear x-axis scale (hence the name semi-log, since only one of the two axes is logarithmic).
Note that countries “peel off” the diagonal trajectory as their rate of new (incident) cases reduces. If the line for a country is horizontal, it means there are no new cases occurring there.
The curve for Australia is clearly flattening, and we are keeping good company with China, South Korea and New Zealand as the other countries with nearly horizontal trajectories. Note that after considerable initial success in containing COVID-19 spread, both Japan and Singapore are now on a upwards trajectory, but the slopes of those trajectories are much shallower than the other countries shown, indicating much slower spread in those countries.
This chart is a variation on the previous Grattan Institute chart. It isn’t really an improvement on the Grattan chart, but is shown here to illustrate the sensitivity of the Grattan chart to the method used to align the dates on the x-axis. In the chart shown at left, the dates on the x-axis are aligned to the approximate start of the COVID-19 epidemic in each country. The start dates are chosen automatically using a non-parametric changepoint detection algorithm, (see the changepoint.np package for R). The changepoints for each country shown are as follows:
| Country | Detected start of epidemic |
|---|---|
| China | 2020-01-18 |
| South Korea | 2020-01-23 |
| Taiwan | 2020-01-24 |
| Singapore | 2020-01-27 |
| Japan | 2020-02-12 |
| Italy | 2020-02-21 |
| Spain | 2020-02-24 |
| Germany | 2020-02-26 |
| France | 2020-02-26 |
| Norway | 2020-02-26 |
| Sweden | 2020-02-26 |
| USA | 2020-02-26 |
| Canada | 2020-02-27 |
| UK | 2020-02-27 |
| Australia | 2020-02-28 |
| New Zealand | 2020-03-18 |
This is another variation on the Grattan Institute chart, but this time showing daily incidence on the y-axis, rather than daily cumulative incidence. It is basically the same information as shown in the epicurve charts in the subsequent frames (accessed via the blue rectangles at the top of the page), but all of the selected countries are shown on one plot, and the dates are aligned by the detected start of the epidemic in each country.
One problem is that the daily incidence curves are rather noisy. We can address that by smoothing them, as shown in the next frame.
In this chart, we can now discern three distinct groups of countries:
Frame 1 (0%)
Frame 2 (1%)
Frame 3 (1%)
Frame 4 (2%)
Frame 5 (2%)
Frame 6 (3%)
Frame 7 (4%)
Frame 8 (4%)
Frame 9 (5%)
Frame 10 (5%)
Frame 11 (6%)
Frame 12 (6%)
Frame 13 (7%)
Frame 14 (8%)
Frame 15 (8%)
Frame 16 (9%)
Frame 17 (9%)
Frame 18 (10%)
Frame 19 (11%)
Frame 20 (11%)
Frame 21 (12%)
Frame 22 (12%)
Frame 23 (13%)
Frame 24 (13%)
Frame 25 (14%)
Frame 26 (15%)
Frame 27 (15%)
Frame 28 (16%)
Frame 29 (16%)
Frame 30 (17%)
Frame 31 (18%)
Frame 32 (18%)
Frame 33 (19%)
Frame 34 (19%)
Frame 35 (20%)
Frame 36 (20%)
Frame 37 (21%)
Frame 38 (22%)
Frame 39 (22%)
Frame 40 (23%)
Frame 41 (23%)
Frame 42 (24%)
Frame 43 (25%)
Frame 44 (25%)
Frame 45 (26%)
Frame 46 (26%)
Frame 47 (27%)
Frame 48 (27%)
Frame 49 (28%)
Frame 50 (29%)
Frame 51 (29%)
Frame 52 (30%)
Frame 53 (30%)
Frame 54 (31%)
Frame 55 (31%)
Frame 56 (32%)
Frame 57 (33%)
Frame 58 (33%)
Frame 59 (34%)
Frame 60 (34%)
Frame 61 (35%)
Frame 62 (36%)
Frame 63 (36%)
Frame 64 (37%)
Frame 65 (37%)
Frame 66 (38%)
Frame 67 (38%)
Frame 68 (39%)
Frame 69 (40%)
Frame 70 (40%)
Frame 71 (41%)
Frame 72 (41%)
Frame 73 (42%)
Frame 74 (43%)
Frame 75 (43%)
Frame 76 (44%)
Frame 77 (44%)
Frame 78 (45%)
Frame 79 (45%)
Frame 80 (46%)
Frame 81 (47%)
Frame 82 (47%)
Frame 83 (48%)
Frame 84 (48%)
Frame 85 (49%)
Frame 86 (50%)
Frame 87 (50%)
Frame 88 (51%)
Frame 89 (51%)
Frame 90 (52%)
Frame 91 (52%)
Frame 92 (53%)
Frame 93 (54%)
Frame 94 (54%)
Frame 95 (55%)
Frame 96 (55%)
Frame 97 (56%)
Frame 98 (56%)
Frame 99 (57%)
Frame 100 (58%)
Frame 101 (58%)
Frame 102 (59%)
Frame 103 (59%)
Frame 104 (60%)
Frame 105 (61%)
Frame 106 (61%)
Frame 107 (62%)
Frame 108 (62%)
Frame 109 (63%)
Frame 110 (63%)
Frame 111 (64%)
Frame 112 (65%)
Frame 113 (65%)
Frame 114 (66%)
Frame 115 (66%)
Frame 116 (67%)
Frame 117 (68%)
Frame 118 (68%)
Frame 119 (69%)
Frame 120 (69%)
Frame 121 (70%)
Frame 122 (70%)
Frame 123 (71%)
Frame 124 (72%)
Frame 125 (72%)
Frame 126 (73%)
Frame 127 (73%)
Frame 128 (74%)
Frame 129 (75%)
Frame 130 (75%)
Frame 131 (76%)
Frame 132 (76%)
Frame 133 (77%)
Frame 134 (77%)
Frame 135 (78%)
Frame 136 (79%)
Frame 137 (79%)
Frame 138 (80%)
Frame 139 (80%)
Frame 140 (81%)
Frame 141 (81%)
Frame 142 (82%)
Frame 143 (83%)
Frame 144 (83%)
Frame 145 (84%)
Frame 146 (84%)
Frame 147 (85%)
Frame 148 (86%)
Frame 149 (86%)
Frame 150 (87%)
Frame 151 (87%)
Frame 152 (88%)
Frame 153 (88%)
Frame 154 (89%)
Frame 155 (90%)
Frame 156 (90%)
Frame 157 (91%)
Frame 158 (91%)
Frame 159 (92%)
Frame 160 (93%)
Frame 161 (93%)
Frame 162 (94%)
Frame 163 (94%)
Frame 164 (95%)
Frame 165 (95%)
Frame 166 (96%)
Frame 167 (97%)
Frame 168 (97%)
Frame 169 (98%)
Frame 170 (98%)
Frame 171 (99%)
Frame 172 (100%)
Finalizing encoding... done!
This chart replicates a chart developed by Aatish Bhatia in collaboration with Minute Physics. The plot shows, for each country, the number of incident (new) cases on the y-axis, on a logarithmic scale, versus the cumulative total of cases on the x-axis, also on a logarithmic scale. When plotted in this way, uncontrolled epidemics with exponential growth take a linear (straight line) trajectory at some angle. As an epidemic is brought under control, the trajectory drops below the straight line, eventually falling vertically if the epidemic has been completely extinguished.
This is just an alternative visualisation of the data shown in earlier frames, but it looks pretty and the animation is helpful, because time is not one of the explicit dimensions of the chart.
The charts shown here are epicurves – that is, daily count of incident (new) cases, using data collated by the European CDC. It is easy to see when (or whether) the epidemic in each country entered a decay phase and the daily incidence started to decline (although the cumulative incidence will, by definition, continue to increase during the decay phase until the epidemic is completely extinguished).
Note that there are some days where the number of cases appears to spike upwards, followed by a decrease the following day. This indicates that there may be some data discrepancies in how the European CDC is capturing data from WHO Situation Reports. It underlines the importance of nations providing reliable machine-readable access to their own COVID-19 data. By “machine-readable” we mean CSV or JSON data files which are automatically downloadable, or an API which can be queried automatically to yield such data. Neither of those are difficult to establish, yet nearly all national governments have failed to provide such data, leaving it to third-party agencies and citizen-science efforts to piece together the required data in a manner that permits ongoing analysis. There is, for example, no official machine-readable source of national COVID-19 data provided by the Australian government. As far as we are aware, NSW is the only State or Territory government that has made any effort in that direction by providing some machine-readable data, which we leverage in the \(R_{t}\) for NSW theme (see menu above).
Forcing the y-axis scale to be the same for all the plots in this chart means that, compared to the previous chart where the y-axis could change for each country, the country with the largest number cases, in this case, the USA, appears the same, and the rest of the plots appear smaller.
This provides important context: the number of cases in the USA currently dominates relative to other countries. The bottom row of countries are barely visible, by comparison.
The final variation the epicurve chart, show here, uses a logarithmic y-axis scale. This emphasises lower counts, which is useful for inspecting the beginnings and ends of the epicurves, or the middle parts where there is more than one “wave”, as in China and Singapore.
A key statistic in field epidemiology is the basic reproduction number, often referred to as “R-zero” or “R-nought”, and written \(R_{0}\). This is the number of people we expect each new case of a disease such as COVID-10 to infect – in other words, how many people each case passes the disease on to – at the beginning of a disease outbreak or epidemic, before any public health controls or interventions have been established or put in place.
\(R_{0}\) was popularised in the 2011 film Contagion. Here is Gwyneth Paltrow, who plays a US CDC epidemiologist, explaining \(R_{0}\) (with a few errors – it’s reproduction number, not “reproductive rate” as she say) to government officials:
As explained in the film clip above, \(R_{0}\) is determined by a number of factors, but is a very useful metric of how fast a communicable disease will spread in a population, if nothing is done to stop or slow it. Note that \(R_{0}\) is not solely a characteristic of a pathogen (such as a virus) per se, but rather depends on the nature and biological behaviour of the pathogen, but also how it is transmitted, who often members of the population engage in behaviours that may transmit it, population density, whether any members of the population are immune or resistant to the disease, and so on. In other words, \(R_{0}\) is highly situation-specific.
As such, \(R_{0}\) is useful in the early stages of an epidemic, but as soon as (effective) public health interventions have been put in place to slow the spread of the disease, the reproduction number will change. This metric is referred to as the effective reproduction number, denoted \(R_{t}\), where \(t\) stands for time. In other words, \(R_{t}\) is the reproduction number at a particular point in time, \(t\).
If \(R_{t}\) equals 1.0
If \(R_{t}\) is greater than 1.0
If \(R_{t}\) is less than 1.0
We estimate the effective reproduction number \(R_{t}\) using a statistical model which was developed in 2013 by Anne Cori and colleagues at Imperial College London (ICL). The method was later extended by Thompson and colleagues (including Anne Cori), also at ICL. We won’t go into the details of the model here – they are described in details in the cited papers – but we will remark on a few key points that need to be born in mind when interpreting these statistics.
The methods we use here estimate the instantaneous effective reproduction number, which is…
The reproduction number is an estimate of the spread of an infection is a specific, local population, and thus we should really only use counts of incident (new) infections which are the result of local spread – that is, we should not include cases which were acquired elsewhere, such as overseas, when estimating it. Doing so will bias our estimates of \(R_{t}\) upwards. In fact, the estimation method used here can make proper use of both incident counts of imported cases and separate counts of locally-acquired cases. Unfortunately, very few jurisdictions have released data which enable that distinction, between imported and locally-acquired cases, to be made, and so we just use total incidence for the estimates shown in this section. However, incident counts of cases, split into imported and locally-acquired counts, is available from NSW Health, and we report on those, much better data in the next section (see menu at top of page).
We use a seven-day trailing sliding window of incidence case counts to calculate the current \(R_{t}\) estimate. Thus, for each date, the estimate is based on the count of incident cases on that date and for the six days prior to that. This is done to smooth the estimates and provide greater precision for each estimate.
The estimation procedure uses Bayesian methods, and we display the median estimate of the posterior distribution of the estimated \(R_{t}\), as well as the 95% credible interval in some of the charts.
Estimation of the effective reproduction number should really use incidence data tabulated (aggregated) by date of symptom onset, not date of reporting or date of notification (to the relevant health authority). Please see the explanatory notes in the previous section on incidence for further discussion of this data gap, and how health authorities could readily address it.
Estimates of the reproduction number depend critically on using the correct distribution for the serial interval. The serial interval is the interval between the onset of symptoms in a case and the onset of symptoms in those infected by that case. In other words, the differences between the time (or date) of onset of an infector and the times or dates of onset of its infectees. Not every pair of infector-infectee cases will have the same serial interval, even for the same infector case. Thus, the serial interval is not a single number, or even a range, but rather a statistical distribution. It is often expressed or summarised as the mean and standard deviation of an idealised distribution, typically a discrete \(\beta\) distribution, a discrete Weibull distribution or a log-normal distribution. In the charts in this section, we have used samples from the Bayesian posterior distribution for the serial interval derived from the observed serial interval pairs for COVID-19 published by Nishiura et al.. Note that the resulting set of samples have a mean which is shorter than most estimates of the incubation period of COVID-19, which is indicative of symptomatic transmission – that is, some cases of COVID-19 infection transmit the infection to others before they themselves start to experience symptoms of illness. That also explains the real difficulties in controlling the spread of the SARS-CoV-2 virus which causes COVID-19. In the next section, \(R_{t}\) for NSW, we also use an alternative method for estimating the serial interval distribution from empirical data.
Further discussion of serial interval estimation and calculation of the effective reproduction number can be found in this technical blog post.
The \(R_{t}\) estimates for each country indicate that most countries are bringing the spread of the virus under control. Currently only Singapore and Japan (amongst these selected countries) are experiencing rising \(R_{t}\). However there is some wide variation. The estimate of \(R_{t}\) for Australia is now well below 1.0, indicating that epidemic is in decay.
This is the same information presented as in the previous graphic, but with each country split into its own graph. This allows us to see the trajectory of effective R.
Most countries are decreasing, but we notice that Japan and Singapore are fluctuating, due to recent outbreaks.
This shows the uncertainty around the median estimates of the effective reproduction number. The grey bands either side of the median estimate indicate the area in which we are 95% certain the true estimate lies (note: these are Bayesian, not frequentist estimates).
The COVID-19 epidemic in Australia appears to have been in decay since…
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The estimates here use samples from a posterior distribution of serial intervals estimated from the data given by Nishiura et al. using the method of Thompson et al.
The charts of the time-varying effective reproduction number \(R_{t}\) for New South Wales COVID-19 incidence data are similar to the analyses presented in the previous section, with the important distinction that incidence cases have been able to be divided into locally-acquired cases and cases where the infection was acquired elsewhere (overseas or interstate). This premits much more accurate estimated of the true \(R_{t}}\).
This has been enable by the publication of more detailed data by NSW Health, available on the NSW government open data web site, and updated daily, in conjunction with data derived by counting pixels in NSW Health charts to obtain earlier data (luckily the charts were high-resoloution and thus the data extracted this way is very accurate).
In each of the frames in this section, estimates of the NSW effective reproduction number are shown, based on either a parameteric serial interval distribution, or a _serial interval distribution derived from data collated by Nishiura et al. using the method of Thompson et al.
Frames in which imported and locally-acquired cases are not distinguished in the estimates are also provided for comparison. These are much less likely to be correct than the estimates using split imported/locally-acquired counts.
In addition, frames in which the counts of cases in NSW have been inflated by a factor of 10 for local cases, and by 50% for imported cases, are also presented. This is a sensitivity analysis which mimics the situation in which there is considerable under-ascertainment of cases – that is, only 1 in 10 cases in the community are actually detected.
The total number of incident cases arising at timestep \(t\), \(I_t\), is the sum of the numbers of incident local (\(I_{t}^{local}\)) and imported (\(I_{t}^{imported}\)) cases,
\[ I_t = I_t^{local} + I_t^{imported} \]
It is assumed that, if imported cases exist, they can be distinguished from local cases, for instance through epidemiological investigations, so that \(I_{t}^{local}\) and \(I_{t}^{imported}\) are observed at each timestep.
\[\Lambda_t(w_s) = \sum_{s=1}^t (I_{t-s}^{local} + I_{t-s}^{imported}) w_s = \sum_{s=1}^t I_{t-s} w_s \] \[ \mathbb E(I_t^{local} | I_0, I_1, \ldots, I_{t-1}, w_s, T_t) = R_t\Lambda_t (w_s)\]
…to be completed.
Commentary
Commentary
In these plots, the counts of incident cases with presumed local sources of infection have been inflated by a factor of 10, and the counts of cases with presumed overseas sources of infection inflated by a factor of 1.5. This mimics ten-fold under-ascertainment of locally-transmitted cases, and 50% under-ascertainment of inbound cases.
In these plots, the counts of incident cases with presumed local sources of infection have been inflated by a factor of 10, and the counts of cases with presumed overseas sources of infection inflated by a factor of 1.5. This mimics ten-fold under-ascertainment of locally-transmitted cases, and 50% under-ascertainment of inbound cases.
In fact, under-ascertainment has little effect on the effective reproduction number, which is driven by the number of cases each case infects, not the total number of cases or infectious individuals.
Commentary
Commentary
Commentary
Commentary